Feature Analysis for Semantic Clustering of Sequence Documents

نویسنده

  • V. Bhuvaneswari
چکیده

The sequence data maintained in public databases are available in heterogeneous formats like FASTA, XML, and ASN1. The XML representation of data is heterogeneous in nature with different DTD in various databases. The difference lies in the representation of sequence description as XML tags. The protein and genomic data in XML format in Genbank has more than 3500 tags to represent the functional description. The sequence documents extracted in any available format, has very vast information related to sequences. Each sequence data has information like its description, alternate names, gene-id, object-id, length, taxon, database references, sequence length and soon. Analyzing the sequence description for understanding the biological process becomes complex due to large number of attributes. The feature selection methods can be applied to select the relevant attributes for genomic and protein dataset. The focus of the study is to select relevant sequence attributes using feature selection methods and semantically group documents.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Integrated Clustering and Feature Selection Scheme for Text Documents

Problem statement: Text documents are the unstructured databases that contain raw data collection. The clustering techniques are used group up the text documents with reference to its similarity. Approach: The feature selection techniques were used to improve the efficiency and accuracy of clustering process. The feature selection was done by eliminate the redundant and irrelevant items from th...

متن کامل

Semantic Clustering of Genomic Documents Using Go Terms as Feature Set

The biological databases generate huge volume of genomics and proteomics data. The sequence information is used by researches to find similarity of genes, proteins and to find other related information. The genomic sequence database consists of large number of attributes as annotations, represented for defining the sequences in Xml format. It is necessary to have proper mechanism to group the d...

متن کامل

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Statistical and Semantic Feature Selection for Text Clustering

Organizing textual documents by categorizing them is important and beneficial for information retrieval; but when it comes to clustering documents containing a huge number of terms, the task become challenged. Therefore, selecting effective features is essential for reducing the feature space dimensionality and improving the clustering performances. While numerous methods have been developed fo...

متن کامل

Review on Text Clustering Using Statistical and Semantic Data

The explosive growth of information stored in unstructured texts created a great demand for new and powerful tools to acquire useful information, such as text mining. Document clustering is one of its the powerful methods and by which document retrieval, organization and summarization can be achieved. Text documents are the unstructured databases that contain raw data collection. The clustering...

متن کامل

CLCL-A Clustering Algorithm Based on Lexical Chain for Large-Scale Documents

Along with explosion of information, how to cluster large-scale documents has become more and more important. This paper proposes a novel document clustering algorithm (CLCL) to solve this problem. This algorithm first constructs lexical chains from feature space to reflect different topics which input documents contain, and documents also can be separated into clusters by these lexical chains....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012